132        Bioinformatics

With the FASTA sequence, we can also download the sequence dictionary file. Once we

have downloaded the FASTA sequence of the reference genome, we can index the sequence

with both samtools and bwa. The following script first creates the directory “refgenome”,

and then, in that directory, it downloads the sequence of the current human genome and

its dictionary file from GATK resource bundle and indexes the FASTA:

mkdir refgenome

cd refgenome

wget https://storage.googleapis.com/genomics-public-data/

resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

wget https://storage.googleapis.com/genomics-public-data/

resources/broad/hg38/v0/Homo_sapiens_assembly38.dict

#or create a fasta dict for reference

f=$(ls *.fasta)

samtools faidx ${f}

bwa index ${f}

cd ..

Only human sequences are available in the GATK resource bundle. However, if you down-

load the sequence of a reference genome from a different database, you may need to create

a dictionary for that sequence using Picard software as follows:

java -jar ~/software/picard.jar \

CreateSequenceDictionary \

R=Reference.fasta \

O=Reference.dict

4.2.2.2.3  Mapping reads to the reference genome

The second step is to align the raw reads to the reference genome that we have downloaded

and indexed in the previous step. We can use the aligner of our choice. The most com-

monly used one is “bwa mem”. The following script creates the directory “sam” and saves

the SAM files that contain the read mapping information into that directory. There will be

a SAM file for each individual.

mkdir sam

cd fastq

ref=$(ls ../refgenome/*.fasta)

for i in $(ls *.fastq | rev | cut -c 13- | rev);

do

bwa mem -M -t 4 \

-R “@RG\tID:${i}\tSM:${i}” \

${ref} \

${i}_chr21.fastq > \

../sam/${i}_chr21.sam 2> ../sam/${i}_chr21.log;

done

cd ..